Evaluating OCR and Non - OCR Text
نویسندگان
چکیده
In literature, many feature types and learning algorithms are proposed for document classiication. However , an extensive and systematic evaluation of the various approaches has not been done yet. In order to investigate diierent text representations for document classiication, we have developed a tool which transforms documents into feature-value representations suitable for standard learning algorithms. In this paper we investigate seven document representations for German texts based on n-grams and single words. We compare their eeectiveness in classifying OCR texts and the corresponding correct ASCII texts in two domains: business letters and abstracts of technical reports. Our results indicate that the use of n-grams is an attractive technique which can even compare to techniques relying on a morphological analysis. This holds for OCR texts as well as for correct ASCII texts.
منابع مشابه
Non-interactive OCR Post-correction for Giga-Scale Digitization Projects
This paper proposes a non-interactive system for reducing the level of OCR-induced typographical variation in large text collections, contemporary and historical. Text-Induced Corpus Clean-up or ticcl (pronounce ’tickle’) focuses on high-frequency words derived from the corpus to be cleaned and gathers all typographical variants for any particular focus word that lie within the predefined Leven...
متن کاملCvl Ocr Db, an Annotated Image Database of Text in Natural Scenes, and Its Usability
Text detection and optical character recognition (OCR) in images of natural scenes is a fairly new computer vision area but yet very useful in numerous applicative areas. Although many implementations gain promising results, they are evaluated mostly on the private image collections that are very hard or even impossible to get. Therefore, it is very difficult to compare them objectively. Since ...
متن کاملDocument Image Dewarping Based on Text Line Detection and Surface Modeling (RESEARCH NOTE)
Document images produced by scanner or digital camera, usually suffer from geometric and photometric distortions. Both of them deteriorate the performance of OCR systems. In this paper, we present a novel method to compensate for undesirable geometric distortions aiming to improve OCR results. Our methodology is based on finding text lines by dynamic local connectivity map and then applying a l...
متن کاملOCR Post-Processing Error Correction Algorithm Using Google's Online Spelling Suggestion
With the advent of digital optical scanners, a lot of paper-based books, textbooks, magazines, articles, and documents are being transformed into an electronic version that can be manipulated by a computer. For this purpose, OCR, short for Optical Character Recognition was developed to translate scanned graphical text into editable computer text. Unfortunately, OCR is still imperfect as it occa...
متن کاملOCR Post-Processing Error Correction Algorithm using Google Online Spelling Suggestion
With the advent of digital optical scanners, a lot of paper-based books, textbooks, magazines, articles, and documents are being transformed into an electronic version that can be manipulated by a computer. For this purpose, OCR, short for Optical Character Recognition was developed to translate scanned graphical text into editable computer text. Unfortunately, OCR is still imperfect as it occa...
متن کاملEvaluating supervised topic models in the presence of OCR errors
Supervised topic models are promising tools for text analytics that simultaneously model topical patterns in document collections and relationships between those topics and document metadata, such as timestamps. We examine empirically the effect of OCR noise on the ability of supervised topic models to produce high quality output through a series of experiments in which we evaluate three superv...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1997